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Abstract. Parametric Embedding (PE) has recently been proposed as 
a general-purpose algorithm for class visualisation. It takes class pos- 
. teriors produced by a mixture-based clustering algorithm and projects 

them in 2D for visualisation. However, although this fully modularised 
\ combination of objectives (clustering and projection) is attractive for its 

conceptual simplicity, in the case of high dimensional data, we show that 
a more optimal combination of these objectives can be achieved by inte- 
grating them both into a consistent probabilistic model. In this way, the 
projection step will fulfil a role of regularisation, guarding against the 
curse of dimensionality. As a result, the tradeoff between clustering and 
visualisation turns out to enhance the predictive abilities of the overall 
' model. We present results on both synthetic data and two real-world 

\^ \ high-dimensional data sets: observed spectra of early-type galaxies and 

. gene expression arrays. 
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Ph! 1 Introduction 

o 

' Clustering and visualisation are two widespread unsupervised data analysis tech- 

, niques, with applications in numerous fields of science and engineering. Two key 

^ ' strategies can be distinguished in the literature: (1) One is to produce a com- 

pression of the data first and then use that to visually detect distinct clusters. 
■ A wide range of linear and nonlinear dimensionality reduction techniques devel- 

, oped in machine learning follow this route, including PC A, GTM etc. (2) The 

' alternative strategy is to cluster the data first and visualise the resulting cluster 

assignments afterwards. This is more popular in the data mining community 
A recently proposed method, termed Parametric Embedding (PE) pro- 
poses to take class posteriors produced by a mixture-based clustering algorithm 
and project them in 2D for visualisation. 

Let us observe however, that for both of these data exploration strategies 
the two objectives - clustering and visualisation - are decoupled and essentially 
one is entirely subordinated to the other. This is worrying in that inevitably, the 
errors accumulated in the first stage cannot be corrected in a subsequent stage 
and may essentially compromise the process of understanding the data. 

In this paper we consider the class visualisation problem, as in [7] and we 
identify a setting where a more fruitful coupling between clustering and visuali- 
sation can be achieved by integrating them both into a consistent probabilistic 
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model. As we shall see, this is particularly useful in high-dimensional problems, 
where the number of data dimensions exceeds the number of observations. Such 
cases are encountered in modern scientific data analysis, e.g. in gene expression 
analysis, or the analysis of special objects in astronomy. 

Our approach is based on a multi-objective formulation in a probabilistic 
formalism. As we shall see, our model can be decoupled and understood as a 
sum of two objectives: One term is responsible for clustering and one other for a 
PE-like projection of the estimated class assignments. These two steps are now 
interdependent, so that in high dimensional problems the projection step fulfils 
a regularisation role, guarding against the curse of dimensionality problem. 

We use both synthetic data and two real- world high-dimensional data sets in 
our experiments: Observed spectra (of rare quality and coverage) of early-type 
galaxies and a benchmark gene expression array data set are used to demonstrate 
the working of the proposed approach. We find that in both cases we obtain not 
only a visualisation of the mixture posteriors, but a more predictive mixture 
model, as measured by out of sample test likelihood, as a result of appropriately 
combining the objectives of clustering and class projection. 

The remainder of the paper is organised as follows: We begin with presenting 
our probabilistic model in Section 2. The interpretation by which this can be 
understood as a joint model for mixture based clustering and PE-like projection 
will become evident in Section 3, where the EM T methodology is used to 
derive a maximum a posteriori (MAP) estimation algorithm for our model. We 
then extend this work to take into account additional available knowledge on 
measurement uncertainties for real- world data analysis. Section 4 presents the 
application two very different experiments involving galaxy spectra and gene 
expression. The results are summarised in the concluding section. 

2 The model 

Consider N independent, T-dimensional data points. The n-th point is denoted 
by dn E TZ'^^^ having features dm- We seek a 2D mapping of this data into 
points Xn € TZ^^^ in the Euclidean space such as to reflect topological relation- 
ships based on some cluster structure that may be present in the data set. In 
building up our model, we begin by making the common assumption of condi- 
tional independence, in order to enforce the dependences among data features 
to be captured in the latent space. 

p{d 

n I •^n ) ='[\_p{dtn\Xn) (1) 
t 

Further, in order to capture complicated density shapes, including possibly dis- 
tinct clusters, we model the conditional probabilities of the data features as a 
mixtures of Gaussians. The mixing coefficients of these mixtures are instance- 
specific, so that each measurement that belongs to the same object will have the 
same mixing coefficient. This will ensure that the various features of an instance 



3 



are likely to be allocated to the same (set of) mixture components. 

p{dtn\Xn) = ^petk{dtn\k)Pck{k\xn) (2) 

k 

Observe we do not impose that each data point must belong to exactly one 
mixture component. This allows us to model the relationships between clusters. 

Assuming that we work with real- valued observations, and p(dtn\k) is a Gaus- 
sian, then 9tk — {fJ-tkyVik} are the mean and precision parameters respectively. 
Other choices are however possible as appropriate. 

P{dtn\k) ^ M{dtn\i^tk,Vtk) (3) 

The second factor in |(2Jl is a nonlinear function that projects a point a;„ 
from the Euclidean space onto a probability simplex. A parameterised softmax 
function can be used for this purpose. 

E.exp -cOn ^ ^ 

Our goal is then to determine Xn for each d„. In addition, we also need to 
estimate the parameters Otk and Ck- 

In order to somewhat narrow down the search space, we add smoothing 
priors, similarly to |7|: 

x„^Af{x„\0,aI); Ck^JV{ck\0,(3I); (5) 

In addition, the inverse variance parameters (precisions) are given exponential 
priors to prevent them produce singularities and encourage the extinction of 
unnecessary model complexity. 

p{vtk) = 76-^"*'= (6) 
The hyperparameters a,P and 7 must all have strictly positive values. 



3 Parameter estimation 

Here we derive MAP estimates for our model specified in the previous section. 
The complete data log likelihood is proportional to the posterior over all hidden 
variables and this is the following. 

C = ^log^pe,„(dt„|fc)PcJfc|x„) -l-^logP(cfc) -|-^logP(a;„) +^logP(«tfe) 

n.t k k n t,k 

> ^ ^ ^ rkt,i {log pg^^{dtn\k) + \ogPc^{k\Xn) - logrfct„} 

n t k 

^ log P(x„) +^ log P(cfe) +^ log P(utfc) (7) 

n k t,k 
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where we used Jensen's inequality and rktn > 0, rktn — 1 represent varia- 
tional parameters that can be obtained from maximising ijTjl: 

^ PetAdtn\k)PcAk\xn) rg. 

We can also regard k = 1, ...,K as the outcome of a hidden class variable and 
rktn are in fact true class posterior probabilities of this class variable, cf. Bayes' 
rule. 

The re-writing (|7I8|I is convenient for deriving the estimation algorithm for 
the parameters of the model. Before proceeding, let us rearrange (|7|) in two main 
terms, so that the interpretation of our model as a combination of mixture-based 
clustering and a PE-like class projection becomes evident. 

Termi = XI XI XI ^og pg^^{dtn\k) - 7^ + const. (9) 

n t k t,k 

Term2 = X X X "l^^S ik\x„) ~ log rktn 

71 t k k 71 

= X ~KL{r.^t,n\\PcX-\x„)) - aX ll^ll" - X I |c| |fc + const. (10) 

rt.t n k 

Now, the first term can be recognised as a clustering model, essentially an in- 
stance of modular mixtures j2j or an aspect-mixture of Gaussians |6I12| , which 
is known to be advantageous in high-dimensional clustering problems The 
second term, in turn, is a PE-like objective (71, which minimises the Kullback- 
Leibler (KL) divergence between the class-posteriors and their projections. Ev- 
idently, these two objectives are now interdependent. It remains to be seen in 
which cases their coupling is advantageous. 

Carrying out the optimisation of Q) yields the following maximum likelihood 
estimates for the means and maximum a posteriori estimates for the precisions. 

For the remaining parameters, there is no closed form solution, we employ 
numerical optimisation using the gradients (see Appendix): 

d 

^ = ^{Ck - Xn)^{rktn - Pck{k\Xn)) - aXn (12) 



dx 

_d 

dck 



d 

T. = ^{Xn - Cfc) ^{rktn ^ PcAMXn)) " /3Cfe (13) 



71 t 

As expected, the form of parameter updates also reflects the interdependence 
of our two objectives: lfTT|l is formally identical with the updates in |2l6ll2j (up 
to the variation implied by the use of the prior for precisions^). The gradients 
lfT^ - (fT!^ are in turn, as expected, very similar to the updates in FE 

^ In ■12. . the authors propose an improper prior for the variances. Incidentally, our 
MAP estimate for the precision parameter (lllll is formally identical to the inverse 
of their variance estimates - so we now see the expression can be derived with the 
use of a proper exponential prior on the precisions. 
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3.1 Empirical Bayesian test likelihood 

Having estimated the model, the empirical Bayesian estimate of the good- 
ness of fit for new points is given by integrating over the empirical distribution 
■i- " ^n)- This is the following. 



3.2 Visualisation 

The 2D coordinates Xn,n — 1 : N provide a visual summary of the data density. 
In addition, label markers (or colouring information), to aid the visual analysis, 
are obtained directly from Pc^(k\x). This is a handy feature of our method, 
as opposed to techniques based on dimensionality reduction methods (such as 
e.g. PC A), where detecting meaningful clusters from the projection plot is not 
straightforward. The class labels may also serve as an interface to the domain 
expert, who may wish to specify and fix the labels of certain points in order to 
explore the structure of the data interactively. 

For accommodating new data points on a visualisation produced from a 
training set, a fast 'folding in' S procedure can be used. This is to compute 

argmaxp(djg^j|a;) with all model parameters kept fixed to their estimated val- 
ues. Conveniently, this optimisation task is convex, i.e. with all other parameters 
kept fixed, the Hessian w.r.t. x is positive semidefinite 

for the same reasons as in the case of PE |2J. Therefore the projection of test 
points is unique. However, PE [J makes no mention of how to accommodate 
new points on an existing visualisation plot. 

3.3 Taking into account estimates of observational error 

It is often the case that data from science domains (e.g. astronomy) come with 
known observational errors. In this section we modify our algorithm to take these 
into account. Let atn be the standard deviation of the known measurement 
error of the i-th feature of instance n. We handle this by considering dtn as 
a hidden variable which stands for the clean data, and in addition we have 

J\fiytn\dtn,l/crt„)- 

Assuming that we are dealing with real valued data, p{dtn\xn) was defined 
as a mixture of Gaussians, and so the integration over the unseen clean data 
variable dtn gives the following likelihood term for component k of feature t: 




(14) 



n 



P{ytn\k) ^ J ddtnJ^{ytn\dtnA/o-tnW{dtn\fJ-tk,Vtk) = ^f{ytn\^J■tkA'^L + ^/'"tk) ^) 

(16) 



6 



In other words, the variance of the data hkeUhood now has two terms, one coming 
from the measurement error and one other coming from the modelhng error. The 
latter needs to be estimated. 

The estimation equations in this case modify as follows: 



J2k'-^iytn\m,(Tt„ + l/Vtk)Pcdf«\x„) 



The update equation of /itk becomes 

E„ ytnTkinl (0-?„ + 1/ Vtk) o^ 
E„ r-fet„/K„ + l/Vtfe) 

and the updates of a;„ and remain unchanged. 

For the precision parameters Vtk there is no closed form solution and so 
numerical optimisation may be employed, e.g. a conjugate gradient w.r.t. logvtk, 
since then the optimisation is unconstrained. 

d _ 1 / 1 {dtn - \ _n nQ^ 

dlogv,k-v,kY'""i'^(^tn + Uvtk) 2ial + l/v,krf ^''^ 

Observe that when atn = 0, all equations of this subsection reduce to those 
presented for the noise-free case in Sec. 3. Yet another alternative is to treat dtn 
as hidden variables and take a hierarchical EM approach. 

It should be highlighted, that although many non-probabilistic methods sim- 
ply ignore the measurement errors even when these are known, due to our prob- 
abilistic framework a principled treatment is possible. This prevents finding 'in- 
teresting' patterns in the visualisation plot as a result of measurement errors, 
at least in the cases when such errors are known. Furthermore, there are cases 
when further refinement of the noise model will be needed, e.g. in many cases 
the recorded error values are uncertain or known to be optimistic. 



4 Experiments and Applications 

4.1 Numerical simulations 

The first set of experiments is meant to demonstrate the working of our method 
and to highlight in which situations it is advantageous over the fully disjoint and 
sequential application of a mixture-based clustering and subsequent visualisation 
strategy. Illustrative cases are shown and these are important for knowing in 
what kind of problems is the method appropriate to use. 

Throughout, we used smoothness hypcrparametcrs a = {3 = 1 and 7 was 
determined by cross-validation under an initial assumption of a large (K=10) 
number of clusters. The priors on the precision parameters favour the extinction 
of unnecessary components and even if there arc remaining redundant compo- 
nents, a good-enough 7 parameter can be located. The typical value obtained 
was of the order of 10~^. Then 7 is fixed and a further cross-validation is run 
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to determine the optimal number of clusters K (less or equal to the number 
of non-empty clusters found in the previous step). We noted the optimisation 
is very sensitive to initialisation and starting /x^, from K-means is beneficial. 
To alleviate problems with local optima, each run was repeated 20 times and 
the model that found better maximum of the model likelihood was retained for 
further testing. 

Two sets of generated data were created. For the first set, 300 points were 
drawn from a 6-dimensional mixture of 5 independent Gaussians (60 points 
in each class). The second set was sampled from a 3Q0-dimensional mixture 
of 5 independent Gaussians (again, 60 points per class). Fig. l.a) shows the 
test likelihood averaged over 20 repeated runs, having generated the test data 
from the same model as the training data. The test likelihood obtained with a 
mixture of Gaussians (MoG) is superimposed for comparison. We see that for 
the relatively low dimensional data set the proposed joint model has little (no) 
advantage. This is simply because in this case there is enough data to reliably 
estimate a MoG. The obtained mixture posteriors could then safely be fed into 
e.g. a PE for class visualisation. 

For the case of high dimensional data, however the situation is different. The 
MoG overfits badly and is therefore unable to identify the clusters or to provide 
reliable input to a subsequent visualisation method. This is the situation when 
our proposed model is of use. The projection part of the objective guards against 
overfitting - as we can see from the test likelihood on Fig l.b. 




Fig. 1. Experiments on synthetic data: a) Out of sample test likelihood (higher 
is better) for the 6-dimensional data — our approach vs MoG. b) Same for the 
300-dimensional data. Our approach is significantly less prone to overfitting than 
MoG in this case, c) The final visualisation plot for the 300-dimensional data 
set. The markers are automatically assigned by the model and in this case they 
are identical with the true generator classes, d) Illustration of the visualisation 
process when the number of assumed clusters (ten) is larger than the number of 
true clusters (five). Notice the true clusters remain compact in the visualisation 
space. 
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Fig l.c shows the visuaUsation of the SOO-dimensional data set. Each point 
is the 2D representation (a;„) of the corresponding 300-D datum point. The 
markers correspond to the maximum argument of the softmax outputs P(fc|a;„), 
so they represent labels automatically assigned by the model. In this case, the 
estimated labels are identical with the true labels. We also see that the true 
number of classes has been correctly recovered. In addition, we noted that in 
experiments where the number of classes was deliberately chosen larger than the 
true number of clusters, some of the unnecessary clusters progressively become 
empty indeed, due to the employed prior on the precision parameters, while 
others split a true cluster. However, notably, the visual image of the true cluster 
split by several components tends to remain compact. Such an example is seen 
on Fig l.d. 

4.2 Visualisation of observed spectra of early-type galaxies 

We apply the method developed above to a sample of measured spectra (in the 
ultraviolet to optical range of radiation) of 21 well-studied early- type (elliptical 
or lenticular) galaxies. In a previous work j8|9j, we had studied this data set 
using various factor analysis techniques. Here, we seek to obtain a visual analysis 
of the data. Each of these spectra represent flux measurements at 348 values of 
wavelength in the range 2000-8000A, in equal bins, for all spectra. Observational 
errors are associated with each value, which we take into account as described 
in Section 3.3. Thus, the clustering and class visualisation of these spectra is a 
high-dimensional problem. 

This represents a pilot data set for an important study in the evolution 
of galaxies. It is generally believed that all early-type galaxies formed all their 
stars in the early universe, and then have evolved more-or-less passively until the 
present day- so one expects to find their spectrum to correspond to a collection 
of stars all of the same age. However, detailed observations in the last decade 
indicate a wealth of complex detail in a significant fraction of such galaxies, 
including evidence of a sub-population of very young stars in many cases. How 
common this effect is largely unknown, and can only be addressed through data 
mining of large spectral archives. Even though many x 10^ galaxy spectra are 
being assembled in large public archives (e.g. www.sdss.org), a sample as detailed 
as ours is rare and difficult to assemble, particularly with such wide a coverage 
in wavelength, which requires combining observations from both ground and 
space based observatories (see details in ^). From this small sample, we would 
attempt to isolate those galaxies which have young stars from those that don't. 

Needless to say, the fluxes are all positive values. In order to be interpretable, 
our method needs to ensure the estimated parameters (cluster prototypes) are 
also positive. In our previous work, we built in explicit constraints to ensure this 
ini- Here, since each fitk in lfTT|l is just a weighted average of positive data, its 
positivity is automatically satisfied. 

The leave-one-out test likelihood of our model is shown on the left plot of 
Fig. 2. The peak a,t K = 2 indicates that two clusters have been identified. A 
mixture of Gaussians test likelihood is superimposed for comparison, showing 
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the overfitting problem due to the high dimensionahty. The MoG is therefore 
unable to identify any clusters in the data. Hence, a class visualisation of the 
data based on mixture posteriors would be clearly compromised in this case. The 
right hand plot shows the grouping of the actual data around the two identified 
prototypes /.t;.,A; = 1,2 of our model. The latter are superimposed with thick 
lines. These can be recognised and interpreted as the prototype of the spectrum 
of a 'young' and 'mature' stellar population respectively. Thus, in this case, the 
clusters have a physical interpretation in astrophysical terms. 




Fig. 2. Left: The leave-one-out test likelihood versus the number of clusters. 
The peak at K = 2 produced by our method indicates two clusters whereas a 
mixture of Gaussians overfits and therefore fails to identify clusters in the data. 
Right: The actual data, clustered around the two prototypes identified by our 
model. The parameters fj,/. for k = 1,2 are superimposed with thick lines. They 
are interpretable as a 'young' and an 'old' prototypical spectrum respectively. 
The identification number that marks some of the spectra correspond to those 
on Fig. 3. The marked spectra are the instances that apart from their overall 
shape present some features of the 'young' category too. 



Identification numbers mark some of the spectra clustered in the 'mature' 
category on the left lower plot of Fig. 2. These are the galaxies that have a 
significantly non-zero class membership for either cluster, and they indeed in- 
clude some morphological aspects of the 'young' category as well (the emission 
lines at < 2000A and the slope of the spectral continuum in the range 6000- 
8000 A). Physically, this indicates the presence of a significant population of 
young (< 2 Gyr old) stars, whereas the rest of the stars are > 10 Gyr old. 

The identification numbers are the same as those on Fig. 3, where we see the 
2D visualisation of the sample on the left. For each spectrum y„, the 2D latent 
representation a;„ is plotted. The markers represent cluster labels automatically 
assigned by the model, as detailed in the right hand plot. We see the two clusters 
are well separated on the image and the 'hybrid' galaxies are indeed placed in 
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Galaxy instances (n^1 ,...,21 ) 



Galaxy instances (n^1 ,...,21 ) 



Fig. 3. Left: Visualisation of the data set of the spectra of 21 early-type galaxies. 
For each spectrum y„, the 2D latent representation x„ is plotted. The markers 
are those assigned by the model, as shown on the right. 



between those that are clearly cluster members. Of these, the one marked as 18 
represents a galaxy (NGC 3605) for which recent detailed physical analyses have 
been made [TO]. It turns out that although more than 85% of its stellar mass is 
associated with an old (9-12 Gyr) stellar population, it does contain a younger 
stellar population too, at ~ 1 Gyr jlO| . 

We therefore conclude that it is possible to have an intuitive visual summary 
of a few hundreds of measurements per galaxy in just two coordinates with the 
application of our method. 



4.3 Visual analysis of gene expressions 

In a brief final experiment we show the potential use of our approach for the 
visual analysis of high dimensional gene expression arrays. Oligonucleotide arrays 
can provide a means of studying the state of a cell, by monitoring the expression 
level of thousands of genes at the same time ^ and have been the focus of 
extensive research. The main difficulty is that the number of examples is typically 
of the order of tens while the number of genes is of the order of thousands. Even 
after eliminating genes that have little variation, we are still left with at least 
hundreds of data dimensions. Straightforward mixture based clustering runs into 
the well-know curse of dimensionality problem. Here we apply our method to the 
ColonCancer data set, having 40 tumour and 22 normal colon tissue samples 
This is a benchmark data set, used in many previous classification and clustering 
studies. Our input matrix consisted of the 500 genes with highest overall variation 
X the 62 samples. 

Fig. ^ shows the visualisation obtained in a purely unsupervised manner. 
The markers now correspond to the true labels (not used by the algorithm), and 
are given for the ease of visual evaluation of the representation produced. The 
separation of cancerous from noncancerous tissues is most apparent on the plot. 
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The potential of such a visuaUsation tool lies mainly in that it would allow a 
domain expert to interactively explore the structure of the sample. Additionally, 
the gene-specific posteriors rktn provide quantitative gene-level class information. 
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Fig. 4. Unsupervised class visualisation of the ColonCancer data set. The markers 
correspond to the true class for the ease of visual evaluation. 



5 Conclusions 



We proposed and investigated a model for class visualisation of explicitly high 
dimensional data. We have shown this model relates closely to PE [J] in that 
it represents a probabilistic integration of the clustering and visualisation ob- 
jectives into a single model. We derived empirical Bayesian estimates for our 
model which make this multi-objective interpretation easy to follow. Although 
this work may potentially further be enhanced by a fuller Bayesian estimation 
scheme, the empirical Bayesian methodology has been appropriate for our pur- 
poses 51] and it allows us to estimate an empirical latent density from a given 
example set of data and to reason about previously unseen data relative to that. 
We demonstrated gains in terms of the predictive capabilities of the proposed 
model over the fully modular and sequential approach to clustering and class 
visualisation in the case of high dimensional data. 
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Appendix. Estimation of Xn 

The terms containing the foUowing. 

X^X^X^^fctn < -i(x„ -Cfc)^ -log^exp(-i(a;„ 

n t k \ k' 

= Y^Y^ \ -X^r/ctni(a;„ -Cfc)^ -log^exp(-i(a;„ 

n t \ k k' 

The gradient is then: 

Renaming k" by fc and replacing the expression of Pc^ (fe|a;n) the following is obtained. 

= {-rktn{Xn - Ck) + PCk{k\^n){Xn - Ck)} - OXn 

t k 

= X^(cfc - Xn) y~^(rfctn - Pcfc(A;|x„)) - aa;„ 
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